Cold-adapted Cretaceous polar turtle resolves the trans-Antarctic radiations of Testudinata
Benjamin P. Kear1*, Márton Rabi2,3, Martin Kundrát4, Michael S. Y. Lee5,6, Daniel Snitting7, Mohamad Bazzi8,9,Barbara E. Wagstaff10, Doris E. Seegets-Villiers11, Peter Trusler12, Thomas H. Rich12, Patricia Vickers-Rich11, & Lesley Kool11,12
1 The Museum of Evolution, Uppsala University, 752 36 Uppsala, Sweden. 2 Department of Geosciences, University of Tübingen, Hölderlinstraße 12, D-72074, Tübingen, Germany. 3 Natural Sciences Collections (ZNS), Martin-Luther University Halle-Wittenberg, Domplatz 4, D-06108, Halle (Saale), Germany. 4 Center for Interdisciplinary Biosciences, Technology and Innovation Park, University of Pavol Jozef Šafárik, Košice 04154, Slovakia. 5 School of Biological Sciences, Flinders University, Adelaide 5001, Australia. 6 South Australian Museum, North Terrace, Adelaide 5000, Australia. 7 Department of Organismal Biology, Uppsala University, 752 36 Uppsala, Sweden. 8 KTH, Royal Institute of Technology, Osquars backe 31, 114 28 Stockholm, Sweden. 9 Paleontological Institute and Museum, University of Zürich, Zürich CH‑8006, Switzerland. 10 School of Earth Sciences, University of Melbourne, Victoria 3010, Australia. 11 School of Earth, Atmosphere and Environment, Monash University, Melbourne, Victoria 3800, Australia. 12 Museums Victoria, PO Box 666, Melbourne, Victoria 3001, Australia.
Workflow for reproducible data and statistical analysis and reporting.
Docker workflow
We use Docker to containerize and render our Quarto R file. A container is a running instance on an image that includes the application (e.g., the programming language interpreter needed to run a workflow) and the system libraries required by an application to run. This is a powerful tool that facilitate persistent reproducibility of analytical work. Our containerized working environment can be accessed via Dockerhub.
Code
include_graphics(path ="docker_workflow.png")
Docker workflow.The process of creating containers starts with a Dockerfile, which acts as a set of instructions for constructing the computational environment. This is utilized to generate an image, and ultimately, the image serves as the foundation for initiating one or multiple containers. Drawing inspired by Nüst et al. 2020, PLoS computational biology.
Exploratory Data Analysis
Dataset. The dataset utilized in this study consists of predictor variables representing length measurements, with the dependent variable of interest being habitat classification. These variables are utilized in the evaluation of model performance.
Code
# Data Input.Turtle.Df <- openxlsx::read.xlsx(xlsxFile ="data/measurement_file.xlsx")# Change character class into factors.Turtle.Df <- Turtle.Df |> dplyr::mutate_if(is.character,as.factor)# Sample size: check for class imbalance.Turtle.Df |> dplyr::count(Habitat) |> dplyr::mutate('%'=round(n/sum(n),2)) |> knitr::kable() |>kable_styling(bootstrap_options ="striped", full_width = F)
Habitat
n
%
Aquatic
49
0.51
Semi-aquatic
9
0.09
Terrestrial
19
0.20
Unknown
19
0.20
Missingness. We used the naniar package v. 1.0.0 (Tierney et al. 2023) for exploring missing data structures. Data missingness is moderate with a total of 32 missing values out of 288 cells (11.1% missingness). Specifically, there are 6 missing values for the variable ‘Humerus length’, 10 missing values for ‘Ulna length’, and 16 missing values for ‘Manual Digit III length’.
Pattern of missingness by dependendent variable (i.e. habitat)
Code
# Check for missing values.sum(is.na(Turtle.Df[,4:6]))# By column.sapply(Turtle.Df, function(x) sum(is.na(x)))# What's the percentage rate of missingness?paste0(round(mean(is.na(Turtle.Df[,4:6])), digits =3) *100,'%')
Test of normality. The Shapiro-Wilk test was used to assess the normality of the distribution of the predictor variables. The P-values obtained from the test were less than the chosen significance level, indicating that the null hypothesis (H0) of normality for the predictor variables could not be accepted. This was further supported by visualizing the distribution of the predictor variables using Quantile-Quantile (Q-Q) plots, which showed a departure from the normal distribution, specifically in the form of a right-skewed distribution for the un-transformed variables.
Code
# Is the distribution of the data normal? Limited to extant complete specimens.apply(X = Turtle.Df[1:77,4:6], MARGIN =2, FUN = shapiro.test) |>lapply(function(x) data.frame(statistic = x$statistic, p.value = x$p.value)) |>bind_rows() |> tibble::rownames_to_column(var ="Variable") |>mutate(Variable =case_when(Variable =="W...1"~"Humerus.Length", Variable =="W...2"~"Ulna.Length", Variable =="W...3"~"Digit.III",TRUE~ Variable)) |> knitr::kable() |>kable_styling(bootstrap_options ="striped", full_width = F)
Normality test, data distribution, and QQ-plots. Results reveal an overall right-skewness for each un-transformed numeric variable. P-values are based on a Shapiro-Wilk Normality Test
Box-and-whisker plot visualization. Inspection of potential outliers by categorical variable
Code
# Set row-names.rownames(Turtle.Df) <-paste(Turtle.Df$Species,Turtle.Df$Specimen)
Code
# Five number summary method.out_var <-lapply(Turtle.Df[,sapply(Turtle.Df,is.numeric)],boxplot.stats)# Rownames of the outliers.out_list <-sapply(Turtle.Df[,sapply(Turtle.Df, is.numeric)], function(x) { stats <-boxplot.stats(x)rownames(Turtle.Df)[which(x %in% stats$out)]})
Outlier removal
Code
# Extant turtle data.extant.Df <- Turtle.Df[1:77,4:7]# Clean with outliers removed using the IQR method: N = 67.clean.Df <-remove_outliers(extant.Df,c("Humerus.Length","Ulna.Length","Digit.III"))# Omitted taxa: N = 10.omit.Df <- extant.Df[rownames(extant.Df) %in% generics::setdiff(rownames(extant.Df),rownames(clean.Df)),]# Create clean dataframe with extinct species included: N = 86.clean.Df <-rbind(clean.Df,Turtle.Df[78:96,4:7])
Single and multiple imputation
The imputation of missing measurements was carried out using the iterative model-based imputation (irmi) method and implemented through the VIM R package (version 6.2.2) (Templ et al., 2011; Kowarik and Templ, 2016). In total, 11 imputed sets (m) were created based on the proportion of missing cases (i.e., 11%). We then moved on to inspect the within-imputation variance and computed 95% confidence intervals to determine the consistency in variable estimation. As a test of reliability, we generated missing values at random on the extant portion of the turtle data and at different thresholds ranging from 20 to 50% missingness. Using these datasets, we proceeded to estimate all randomly generated missing values using the initial approach (i.e., irmi) and compared the results against the actual (true) measurements. Finally, we computed and compared imputation results attained using multiple imputation by chained equations (mice) using the mice R package (version 3.15.0) (van Buuren and Groothuis-Oudshoorn, 2011).
Code
# Single IRMI.set.seed(1)imp.1<- VIM::irmi(x = clean.Df[,1:3],robust =TRUE,imp_var = F,trace = T)imp.1<-round(imp.1,digits =2)# Multiple IRMI.set.seed(2)imp.2<- VIM::irmi(x = clean.Df[,1:3],robust =TRUE,mi =11,imp_var = F)# Any missing values?sapply(imp.2,function(x) any(is.na(x)))# Add imputed values as a unique column to dataframe.colnames(imp.1) <-c("Irmi.Humerus","Irmi.Ulna","Irmi.Digit")# Merge.compFin.Df <-cbind(clean.Df,imp.1)# Add identifier variable. Measurement <-as.factor(c(rep("Raw",67),rep("Imputed",17),rep("Raw",2)))# New factor.compFin.Df$Measurement <- Measurement# Species namespecies_name <-rownames(clean.Df)[68:86]
Imputation consistency
Code
# Pull out extinct taxa that have been estimated.irmi_extinct <-lapply(imp.2,FUN = slice,68:86)# Function to add rownames and new columnw <-function(x){row.names(x) <-as.character(rownames(clean.Df[68:86,]))cbind(x, taxa =as.factor(rownames(clean.Df[68:86,]))) }irmi_var <-lapply(irmi_extinct,FUN = w)# Combine data frames.irmi_multiple <- irmi_var |>bind_rows(.id ="id")irmi_multiple |> tidyr::pivot_longer(cols =-c(id,taxa), names_to ="variable", values_to ="value") |>ggplot(aes(x = taxa, y = value, color = variable)) +geom_point() +scale_color_manual(values =c("#00AFBB", "#E7B800", "#FC4E07")) +scale_fill_manual(values =c("#00AFBB", "#E7B800", "#FC4E07")) +stat_summary(fun = mean, geom ="point", fill ="grey70", color ="black") +stat_summary(fun.data = mean_cl_normal, geom ="errorbar", width =0.2, color ="black") +facet_wrap(~ variable,scales ="free") +labs(y ="Value", x ="", title ="Imputation variance across IRMI-computed trait dataset",subtitle ="95% confidence intervals") +theme_bw() +theme(axis.text.x =element_text(angle =45,hjust =1, vjust =1,size =2),aspect.ratio =1,legend.position ="none",plot.title =element_text(face ="bold"))
Imputation performance
Code
# Randomly introduce missing values.pM.extant <- clean.Df[1:67,1:3]generate_random_NA <-function(df, missing_prob =0.1) {# Copy. df_NA <- df# Loop through each column of the dataframefor (col in1:ncol(df_NA)) {# Generate a random number between 0 and 1 for each element in the column rand_nums <-runif(nrow(df_NA))# Check if each element should be replaced with NA based on the # missing probabilityfor (i in1:nrow(df_NA)) {if (rand_nums[i] < missing_prob) {# check if all the other element of the row are NAif(sum(is.na(df_NA[i,]))==ncol(df_NA)-1) next; df_NA[i,col] <-NA } } }return(df_NA)}set.seed(3)# Create missingness.Missing_20 <-generate_random_NA(pM.extant,missing_prob =0.2)Missing_30 <-generate_random_NA(pM.extant,missing_prob =0.3)Missing_40 <-generate_random_NA(pM.extant,missing_prob =0.4)Missing_50 <-generate_random_NA(pM.extant,missing_prob =0.5)# List.random_list <-list(Missing_20,Missing_30,Missing_40,Missing_50)# Run IRMI imputation on all data frames and save results.impute_list <-vector(mode='list', length=4)set.seed(4)# Run imputation on all m copies.for(i inseq_along(random_list)) { impute_res <-irmi(random_list[[i]],robust =FALSE,imp_var = T) impute_list[[i]] <- impute_res}# Correlation test (Humerus length).Humerus.Path <-"data-raw/imputation_performance_humerus.xlsx"Humerus_list <-list()# Import data filefor(i inc(1:4)) { Humerus.Data <-read.xlsx(Humerus.Path,i) Humerus_list[[i]] <- Humerus.Data}cor_humerus_list <-lapply(Humerus_list, function(x) cor(x[,"Humerus.Imputed"], x[,"Humerus.Acctual"]))# Correlation test (Ulna length).Ulna.Path <-"data-raw/imputation_performance_ulna.xlsx"Ulna_list <-list()for(i inc(1:4)) { Ulna.Data <-read.xlsx(Ulna.Path,i) Ulna_list[[i]] <- Ulna.Data}cor_ulna_list <-lapply(Ulna_list, function(x) cor(x[,"Ulna.Imputed"], x[,"Ulna.Acctual"]))# Correlation test (Digit III length).Digit.Path <-"data-raw/imputation_performance_digit.xlsx"Digit_list <-list()for(i inc(1:4)) { Digit.Data <-read.xlsx(Digit.Path,i) Digit_list[[i]] <- Digit.Data}cor_digit_list <-lapply(Digit_list, function(x) cor(x[,"Digit.Imputed"], x[,"Digit.Acctual"]))
Results:
Code
# Graphic layout,layout(matrix(c(1:3), nrow =1, ncol =3,byrow =FALSE),heights =c(2.5,2.5,2.5),widths =c(2,2,2),respect =TRUE)irmi.cols <-c(5:7)for(i in irmi.cols) {if(is.numeric(compFin.Df[,i]))plot(compFin.Df[,i], col =c("#1c4966", "#b11226")[compFin.Df$Measurement],pch =21, bg ="white",ylab =colnames(compFin.Df)[i], font.lab =2, sub ="Median represented by solid lines") abline(h =median(compFin.Df[compFin.Df$Measurement =="Raw",i]), col ="#b11226")abline(h =median(compFin.Df[compFin.Df$Measurement =="Imputed",i]),col ="#1c4966")# legend("topleft",legend = c("Raw","Imputed"),# col = c("#b11226","#1c4966"),lty = 1,lwd = 2, border = FALSE,# bty = "n",cex = .3)}# Add titlemtext(text ="Bivariate plots comparing observed and imputed values per variable",side =3,line =2,at =-90,cex =1,font =2); par(op)
Mice
Code
# Object contain 11 imputed dataset(s)# Predictive mean matching.imp.3<-mice(data = clean.Df[,1:3], m =11, method ="pmm",seed =1991,print = T, remove.collinear = F,printFlag =FALSE)# Check predictor matriximp.3$predictorMatrix# Diagnostics.# stripplot(imp.3, pch = c(1, 20))# Check correlation between variables.with(imp.3, cor(Humerus.Length, Ulna.Length))with(imp.3, cor(Humerus.Length, Digit.III))with(imp.3, cor(Ulna.Length, Digit.III))# Are there any negative values?sum(imp.3$imp$Humerus.Length <0)sum(imp.3$imp$Ulna.Length <0)sum(imp.3$imp$Digit.III <0)# Complete mice dataset as list.complete_mice <- mice::complete(data = imp.3,"all",inc =FALSE)
Code
# Add new species column.taxa_labels <-rownames(complete_mice[[1]])complete_mice <-lapply(complete_mice,function(x) cbind(x, taxa =as.factor(taxa_labels)))
Phylogenetic Imputation
We used the Rphylopars (v. 0.3.10) R package (Goolsby et al. 2016) to estimate missing data while accounting for both within-species variation and the evolutionary relationships between species. Prior to phylogenetic imputation, we first pruned our tree topology using functions from the ape R package to match the taxa in our trait dataset. The final tree and trait data included four fossil taxa (Meiolania platyceps, Strzeleckemys bunurongi, Spoochelys ormondea, Palaeochersis talampayensis) and 17 extant taxa (Emys orbicularis, Trachemys scripta, Chrysemys picta, Platysternon megacephalum, Gopherus polyphemus, Dermochelys coriacea, Chelonia mydas, Caretta caretta, Sternotherus odoratus, Kinosternon flavescens, Chelydra serpentina, Pelodiscus sinensis, Lissemys punctata, Carettochelys insculpta, Podocnemis expansa, Chelodina longicollis, Chelus fimbriatus). We log-10 transformed our predictor variables, run phylogenetic imputations, and converted both raw and estimated values back to metric scale. The imputed values were subsequently compared against the IRMI generated estimates.
Result
Correlation test shows that imputed trait data using phylogentic imputation versus iterative robust model-based imputation converge on approximate estimates. Some minor variation is however indicated for the Digit III variable (correlation coefficient = 0.81). For additional details (see Discriminant analysis).
names(pruned_data)[1] <-"species"names(pruned_data)[2] <-"V1"names(pruned_data)[3] <-"V2"names(pruned_data)[4] <-"V3"# Numeric variables needs to be log10 transformed.pruned_data <-rapply(pruned_data, f = log10, classes =c("numeric", "integer"), how ="replace")pdc <- pruned_data[,-5]# Create list.imp.PhyloData <-list(pdc,pruned_tree)names(imp.PhyloData)[[1]] <-"trait"names(imp.PhyloData)[[2]] <-"tree"# Phylogenetic Imputation using phylopars.PPE <-phylopars(trait_data = imp.PhyloData$trait,tree = imp.PhyloData$tree)# Transform back to units back to centimeter measurements.PPE$ind_recon[,2:4] <-10^PPE$ind_recon[,2:4]# Imputed fossil taxa.pImp <- PPE$ind_recon[c(22:35),] |>as_tibble() |>as.data.frame() |> dplyr::rename("taxa"="trait_data.species")# Original specimen namesos_name <-grep(pattern ="Spoochely|Strzeleckemys|Meiolania platyceps",x =rownames(Turtle.Df),value = T)# Subset IRMI dataset to match taxa in the phylopars dataset.irmi_subset <-lapply(irmi_var, function(x) x[os_name, , drop =FALSE])
The comp_list includes all observations for which habitat is recorded. This also means that every dataset in this list are identical all across.
The miss_list includes all observations for which habitat was not recorded.
Code
# Extant raw data.o1 <-lapply(imp.2,FUN = slice,1:67)# Extinct imputed data: habitat designated observations as 'Unknown' (N=19)o2 <-lapply(imp.2,FUN = slice,68:86)# Filter and pre-process data list.comp_list <-lapply(o1, function(x) { subset <- clean.Df[1:67, "Habitat"]# Add habitat back to the list. x <-cbind(x, Habitat =as.factor(droplevels(subset)))# Reorder variables. x <- x[, c("Habitat", "Humerus.Length", "Ulna.Length", "Digit.III")]})miss_list <- o2# Remove taxon label.# comp_list <- lapply(complete_list, function(x){x["taxa"] <- NULL; x})
Multicollinearity
Code
# Extract the first dataset from the comp_listall.Df <- comp_list[[1]] # No. 67 observations.# Average and standard deviation.knitr::kable(data.frame(mean = (apply(all.Df[,2:4],2,mean)),sd = (apply(all.Df[,2:4],2,sd)))) |>kable_styling(bootstrap_options ="striped")
mean
sd
Humerus.Length
36.35522
13.110305
Ulna.Length
23.63582
9.012932
Digit.III
23.00000
10.681406
Correlation network
Code
# Final check of multicollinearity.var_names <-c("Humerus","Ulna","Digit III")qgraph(abs(cor(sapply(all.Df[,2:4],as.numeric),method ="spearman")),graph ="pcor", layout ="spring",vsize =6,nodeNames = var_names, legend.cex =0.4)
Correlation network
Variance inflation factor index
Code
# NB: VIF > 10 is considered to be problematic (see Greene, W 2012).car::vif(lm(as.integer(Habitat) ~ Humerus.Length + Ulna.Length + Digit.III, data = all.Df))car::vif(lm(as.integer(Habitat) ~ Humerus.Length + Digit.III, data = all.Df))car::vif(lm(as.integer(Habitat) ~ Ulna.Length + Digit.III, data = all.Df))# Durbin-Watson Test.lmtest::dwtest(lm(as.integer(Habitat) ~ Humerus.Length + Ulna.Length + Digit.III, data = all.Df))lmtest::dwtest(lm(as.integer(Habitat) ~ Humerus.Length + Digit.III, data = all.Df))lmtest::dwtest(lm(as.integer(Habitat) ~ Ulna.Length + Digit.III, data = all.Df))
Side-by-side box-and-whisker plot of anatomical descriptors. Data have been pre-processed and does not include influential observations identified based on IQR-criterion. Confidence intervals obtained via non-parametric bootstrapping.
Multinomial logistic regression
Workflow
We first checked for near zero or zero-variance in our any predictor variables.
The variance inflation factor and a Durbin Watson test (see Variance inflation factor index) confirm the presence of multicollinearity. The redundancy introduced by the near exact collinearity between the ‘Humerus’ and ‘Ulna’ length variables impacts the MLR-coefficient estimation. Combining these two variable into one did not solve the problem either. We therefore decided to drop the ‘Ulna’ variable from our classification analysis.
We fitted a MLR-model using the full dataset (i.e., here also refereed to as training set) for which habitat ID’s were available to estimate nominal outcome variables.
This dataset have already been pre-processed with influential outlier points omitted.
The analysis was performed on the raw-data and not on log-10 transformed variables.
The Aquatic class was used as the baseline category.
We validated the accuracy of this model by examining the match between predicted vs. actual outcomes.
We also computed other metrics (e.g., Kappa, Precision, Recall, and F-Measure score).
We also conducted a drop-in-deviance test to determine the influence of potential non-informative predictors on our main model.
We proceeded to visualize the prediction results using stacked bar plots with the probabilities of each category shown by taxa.
We proceeded to estimate response probabilities on the IRMI datasets (both single and multiple). We summarized the probabilities across all iterations and visualized the results using pie graphs.
Finally, we explored oversampling via SMOTE over-sampling to test the effect of class imbalance. Accordingly, we re-fitted a new series of multinomial logistic models and evaluated their respective accuracy.
Model terms
Habitat = Unordered nominal (i.e., categorical or polytomous ‘dependent/response’ variable).
Digit III Length = numeric independent/predictor variable.
Code
# Check for zero-variance?nearZeroVar(all.Df[2:4], saveMetrics = T) |>smoothScatter()
Code
nearZeroVar(all.Df[c(2,4)], saveMetrics = T)# Drop Ulna from the dataframe.Final.Df <- all.Df[c(1:2,4)]# Training and testing for the purpose of cross-validation.set.seed(123)idx <- caret::createDataPartition(Final.Df$Habitat, p =0.7, list =FALSE)train_data <- Final.Df[idx,] # 48test_data <- Final.Df[-idx,] # 19# Null (or empty) multinomial model. By default a basement/reference level will be selected alphabetically.The logistic coefficient is the expected amount of change in the logit for each one unit change in the predictor.set.seed(1995)mod.intercept <-multinom(formula = Habitat ~1, data = train_data, Hess = T,model = T)summary(mod.intercept)$coefficients# Table.stargazer(mod.intercept, type="text", out="intercept-only-model.htm",title ="Estimated Parameters (and Standard Errors) from intercept-only multinomial logistic regression")
# Full multinomial model# Does the inclusion of predictor variables improve the model fit? Yes.# By default, coefficients are interpreted as relative log-odds.summary(mod.3<-multinom(Habitat ~., data = train_data,Hess = T,model = T)) |> magrittr::extract('coefficients')## Exponentiating the slope coefficients gives odds ratios (i.e., relative risks), relative to the baseline category.data.frame(t(exp(summary(mod.3)$coefficients))) |>round(digits =3)## Relative risksrelative.risks <-exp(coef(mod.3))# Table along with p-value statistics for predictor variables.stargazer(mod.3, type="text",coef=list(relative.risks), p.auto=FALSE, out="multinomial-model.htm",title ="Estimated Parameters (and Standard Errors) from multinomial logistic regression with predictor variables",column.labels =c("vs. Aquatic","vs. Aquatic"))# Two-tailed z test.z <-summary(mod.3)$coefficients/summary(mod.3)$standard.errorsp <- (1-pnorm(abs(z), 0, 1))*2# Predictors are all significant.# Convert odds ratio into a proportion.2.94/(1+2.94)0.19/(1+0.19)7.12/(1+7.12)0.03/(1+0.03)
Equation 1: Category B (Semi-aquatic) compared to Category A (Aquatic)
The log-odds of Y = Semi-aquatic versus Y = Aquatic increases by 1.08 for every one-unit change in humerus length. The corresponding odds ratio is 2.94, with a standard error of 0.44, a Wald-statistic of 2.45, and a significant p-value of 0.01.
Conversely, the log-odds of Y = Semi-aquatic versus Y = Aquatic decreases by 1.64 for every one-unit change in manus digit-III length. The corresponding odds ratio is 0.19, with a standard error of 0.67, a Wald-statistic of -2.44, and a significant p-value of 0.01.
A one-unit increase in humerus length is associated with a 1.96 increase (or 87.7%) in the log odds of a turtle species being classified as Terrestrial rather than Aquatic. The corresponding odds ratio is 7.12, with a standard error of 0.72, a Wald-statistic of 2.69, and a significant p-value of 0.007.
The log-odds of Y = Terrestrial versus Y = Aquatic decreases by 3.27 for every one-unit change in manus digit-III length. The corresponding odds ratio is 0.03, with a standard error of 1.22, a Wald-statistic of -2.67, and a significant p-value of 0.007.
Predicted probabilities
Expected probabilities for each level of the outcome variable based on the model coefficients.
# Analysis of deviance:# This is a measure of goodness-of-fit test statistics (i.e., high deviance indicates a bad fit).good.fit <- mod.intercept$deviance - mod.3$deviance# The (non-central) Chi-Squared Distributionpchisq(q = good.fit,df = mod.3$edf, lower.tail=FALSE)
# Goodness of fit (observed vs. expected).ch.q <-chisq.test(x = train_data$Habitat,predict(mod.3),simulate.p.value = T)# Hosmer and Lemeshow test (multinomial model)# Pearson’s chi-squared statistic, Degrees of freedom, and significance.# Null hypothesis is retained.hosmer_lew <-suppressWarnings(logitgof(train_data$Habitat, fitted(mod.3)))# Tabledata.frame(statistics = hosmer_lew$statistic,df = hosmer_lew$parameter,p = hosmer_lew$p.value) |> knitr::kable(caption ="Hosmer and Lemeshow test (multinomial model)") |>kable_styling(bootstrap_options ="striped")
Hosmer and Lemeshow test (multinomial model)
statistics
df
p
X-squared
3.545558
16
0.999491
Pseudo R-Squares
Code
# Pseudo-R2.DescTools::PseudoR2(x = mod.3, which =c("CoxSnell","Nagelkerke","McFadden"))# Likelihood Ratio Tests: the LRT is the Chisq statistic.LRT <- MASS::dropterm(mod.3, trace=FALSE, test="Chisq")LRT |>print()# Influential variables.var.imp <-varImp(mod.3)var.imp$var <-rownames(var.imp)var.imp |>ggplot(aes(x=var, y=Overall)) +geom_segment(aes(x=var, xend=var, y=0, yend=Overall), color="grey") +geom_point(color="black", size=4) +ylab("Influential variables") +xlab("")+theme_light() +theme(aspect.ratio =1)
kable(stats::anova(mod.3, mod.4, test ="Chisq"), format ="markdown")
Model
Resid. df
Resid. Dev
Test
Df
LR stat.
Pr(Chi)
Digit.III
92
72.14784
NA
NA
NA
Humerus.Length + Digit.III
90
24.39112
1 vs 2
2
47.75672
0
Repeated K-fold cross-validation
Code
# Perform cross-validation to select the best parametersfit.control <- caret::trainControl(method ="repeatedcv",number =10, repeats =1000)# Repeatability of results.set.seed(1998)fit.cv <- caret::train(Habitat ~., data = train_data, method ="multinom",trControl = fit.control,verboseIter =TRUE,trace =FALSE)## Classification table of MNL.confusionMatrix(data = fit.cv)# Mis-classification ratecat("Misclassification rate:", paste0(round(1-0.8329, 2)),"%")# Model validation/performance using the test set.predictTest <-predict(mod.3, newdata = test_data, type ="class")# F1-score, recall, and precision, etc.performance.Df <-data.frame(obs = test_data$Habitat, pred = predictTest)classes <-c("Aquatic", "Semi-aquatic", "Terrestrial")# Threshold-dependent vs invariant metrics.multiClassSummary(data = performance.Df,lev = classes)[c("Accuracy","Kappa","Mean_F1","Mean_Precision","Mean_Recall")]
Model comparison and selection
Code
set.seed(1991)control <-trainControl(method ="cv", number =10)# Machine learning algorithms.mod.5<-train(Habitat ~., data = train_data, method ="multinom", trControl = control)mod.6<-train(Habitat ~., data = train_data, method ="rpart", trControl = control)mod.7<-train(Habitat ~., data = train_data, method ="knn", trControl = control)mod.8<-train(Habitat ~., data = train_data, method ="svmRadial", trControl = control)mod.9<-train(Habitat ~., data = train_data, method ="rf", trControl = control)
Code
# Results.res <-resamples(list(multino = mod.5, rpart = mod.6, knn = mod.7,svmRadial = mod.8, rd = mod.9))lattice::dotplot(res) # a multinomial model was clearly the best choice
where \hat{P}{ij} is the pooled predicted probability of class j for observation i, \hat{P}{ijk} is the predicted probability of class j for observation i in the kth imputed dataset, and m is the total number of imputed datasets.
Code
fossil.Pred <-vector("list",length =11)for(i inseq_along(miss_list)) { save.pred <- nnet:::predict.multinom(mod.3,newdata = miss_list[[i]],type ="prob") fossil.Pred[[i]] <- save.pred}# Pool the predicted probabilities across all imputed datasets.pooled_preds <-Reduce(`+`, fossil.Pred) /length(fossil.Pred)# Predicted class for each observation.class_preds <-apply(pooled_preds, 1, which.max)# Predicted probabilitiespred_df <-data.frame(predicted_class = class_preds,Aquatic = pooled_preds[, "Aquatic"], Semiaquatic = pooled_preds[, "Semi-aquatic"], Terrestrial = pooled_preds[, "Terrestrial"],taxon =rownames(clean.Df)[68:86])# Wide to long format.pred_df_long <- tidyr::gather(pred_df, key ="habitat_type", value ="predicted_prob", -predicted_class,-taxon)# Probability distribution of imbalanced categoriesggplot(pred_df_long, aes(x ="", y = predicted_prob, fill = habitat_type)) +geom_bar(stat ="identity", width =1) +scale_fill_manual(values =c("#2D716F","#2E2D29","#B83A4B")) +coord_polar("y", start =0) +facet_wrap(~ taxon, nrow =3)+theme_void() +labs(x ="%", y =NULL, fill ="Habitat Type",title ="Fossil ecologies",subtitle ="Pooled predicted probabilities by class") +theme(strip.text =element_text(size =4),plot.title =element_text(face ="bold")) +geom_text(data = pred_df_long, aes(label =paste0(round(predicted_prob*100), "%")),position =position_stack(vjust =0.5), size =3) +geom_rect(data =subset(pred_df_long, taxon %in%c("Strzeleckemys bunurongi NMV P208086a","Strzeleckemys bunurongi NMV P208129","Strzeleckemys bunurongi NMV P231470")),fill =NA, colour ="black", xmin =-Inf,xmax =Inf,ymin =-Inf,ymax =Inf)
Code
# Prediction on input data (Phylogentic Imputed Data).pImDd <- pImp |> dplyr::select(-c(1,3)) |> dplyr::rename("Humerus.Length"="V1", "Digit.III"="V3")nnet:::predict.multinom(mod.3,newdata = pImDd,type ="prob") |>as.data.frame() |>mutate(Specimen = os_name) |> tidyr::pivot_longer(cols =c("Aquatic", "Semi-aquatic", "Terrestrial"),names_to ="Habitat",values_to ="Predictions") |>ggplot(mapping =aes(x ="", y = Predictions, fill = Habitat)) +geom_bar(stat ="identity", width =1, color ="#006B81") +labs(x ="%", y =NULL,title ="Fossil ecologies",subtitle ="Predicted probabilities based on phylogenetic imputed trait estimates") +scale_fill_manual(values =c("#2D716F","#2E2D29","#B83A4B")) +coord_polar("y", start =0) +geom_text(aes(label =paste0(round(Predictions*100), "%")),position =position_stack(vjust =0.5), size =3,color ="white") +facet_wrap(~ Specimen, ncol =5)+theme_void() +theme(strip.text =element_text(size =5),plot.title =element_text(face ="bold"))
Smote (class imbalance)
SMOTE synthesizes new minority instances between existing minority records After synthesizing new minority records, the imbalance shrinks from 7 semi-aquatic to 21 semi-aquatic instances.
Code
# Synthetic Minority Oversampling Technique.train.smoote <- smotefamily::SMOTE(X = train_data[,2:3],target = train_data$Habitat,K =5,dup_size =2)# Multinomial model.smote_nnet <- nnet::multinom(formula = class ~., data = train.smoote$data)smote_nnet |>summary()# Model accuracy on the hold-out set.smote_nnet_predict <-predict(object = smote_nnet,newdata = test_data, type ="class")# Confusion matrix.confusionMatrix(test_data$Habitat,smote_nnet_predict)
# Pooled predicted probabilities.IRMI.SMOTE <-vector("list",length =11)for(i inseq_along(miss_list)) { save.pred <-predict(smote_nnet,newdata = miss_list[[i]],type ="prob") IRMI.SMOTE[[i]] <- save.pred}# Average results and compare against the prediction based on unbalanced data.pooled_smote <-Reduce(`+`, IRMI.SMOTE)/length(IRMI.SMOTE)# Predicted class.class_smote <-apply(pooled_smote, 1, which.max)